MILC: Inverted List Compression in Memory

نویسندگان

  • Jianguo Wang
  • Chunbin Lin
  • Ruining He
  • Moojin Chae
  • Yannis Papakonstantinou
  • Steven Swanson
چکیده

Inverted list compression is a topic that has been studied for 50 years due to its fundamental importance in numerous applications including information retrieval, databases, and graph analytics. Typically, an inverted list compression algorithm is evaluated on its space overhead and query processing time. Earlier list compression designs mainly focused on minimizing the space overhead to reduce expensive disk I/O time in disk-oriented systems. But the recent trend is shifted towards reducing query processing time because the underlying systems tend to be memory-resident. Although there are many highly optimized compression approaches in main memory, there is still a considerable performance gap between query processing over compressed lists and uncompressed lists, which motivates this work. In this work, we set out to bridge this performance gap for the first time by proposing a new compression scheme, namely, MILC (memory inverted list compression). MILC relies on a series of techniques including offset-oriented fixedbit encoding, dynamic partitioning, in-block compression, cache-aware optimization, and SIMD acceleration. We conduct experiments on three real-world datasets in information retrieval, databases, and graph analytics to demonstrate the high performance and low space overhead of MILC. We compareMILC with 12 recent compression algorithms and experimentally show that MILC improves the query performance by up to 13.2× and reduces the space overhead by up to 4.7×.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Re-Pair Compression of Inverted Lists

Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressi...

متن کامل

Efficient Term Extraction and Indexing Approach in Small-Scale Web Search of Uyghur Language

In order to avoid the frequently read-write of hard disk and to speed up the search, the index should be saving in the memory in the small-scale web search. But, to express the original information by fewer memory spaces, also needs for index compression, and this would increases the computation expenses or brings certain harm to the original information in a way. In this research of Uyghur sma...

متن کامل

Algorithms and data structures for in-memory text search engines

In the current information era, everyone has instant access to huge amounts of textual data. Unfortunately, the information is often unstructured and difficult to handle. So the use of search engines is indispensable. Different aspects of search engines are well-established but still active areas of research. Across the relevant literature, inverted index data structures have been proved to be ...

متن کامل

I Inverted Index Compression

The data structure at the core of nowadays largescale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence `t , listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence `t is called ...

متن کامل

Positional Data Organization and Compression in Web Inverted Indexes

To sustain the tremendous workloads they suffer on a daily basis, Web search engines employ highly compressed data structures known as inverted indexes. Previous works demonstrated that organizing the inverted lists of the index in individual blocks of postings leads to significant efficiency improvements. Moreover, the recent literature has shown that the current state-of-the-art compression s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017